Star Wars Predictive Analytics

Author

Brock

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import *

LetsPlot.setup_html(isolated_frame=True)

Introduction

Can Star Wars-related survey responses be used to predict whether a person earns more than $50,000 annually??

Data Loading & Missing Value Analysis

Show the code
# First 2 rows are column headers, created a map to make it easy to read
new_col_names = ["id", "seen_any", "fan_starwars", 
"seen_Ep_I", "seen_Ep_II", "seen_Ep_III", "seen_Ep_IV", "seen_Ep_V", "seen_Ep_VI", 
"rank_Ep_I", "rank_Ep_II", "rank_Ep_III", "rank_Ep_IV", "rank_Ep_V", "rank_Ep_VI", 
"fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda", 
"who_shot_first", "familiar_expanded_universe", "fan_expanded_universe", "fan_startrek", "gender", "age", "income", "educ", "region"] 

# Read in the data  to a pandas data-frame from CSV github
df = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors="ignore", names = new_col_names, skiprows = 2)

# Replace common missing indicators with np.nan
df = df.replace(['NA', 'N/A', 'na', 'n/a', '', 'NaN', 'nan', 999, -99, None], np.nan)

# Display a glimpse of the data
display(df.head())
id seen_any fan_starwars seen_Ep_I seen_Ep_II seen_Ep_III seen_Ep_IV seen_Ep_V seen_Ep_VI rank_Ep_I ... fav_yoda who_shot_first familiar_expanded_universe fan_expanded_universe fan_startrek gender age income educ region
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

The dataset now loads cleanly into a DataFrame with 38 columns covering Star Wars movie viewership, rankings, favorite characters, and demographic information (gender, age, income, education, region).

From here, I’ll inspect the structure, data types, and missing values to prompt the direction of this project.

Missing Value Analysis

The columns with the highest percentage of missing values reveal clear patterns in survey behavior:

Show the code
# Get data types for internal use
datatypes = df.dtypes

# Percent missing values of each column
missing = (df.isna().sum() / len(df) * 100).round(1)
display(f"Top 10 Missing Values Percentages of Columns")
display(missing.sort_values(ascending=False).head(10))  # Show top 10
'Top 10 Missing Values Percentages of Columns'
fan_expanded_universe    82.0
seen_Ep_III              53.6
seen_Ep_II               51.9
seen_Ep_IV               48.8
seen_Ep_I                43.3
seen_Ep_VI               37.8
seen_Ep_V                36.1
fav_boba                 31.5
fav_palpatine            31.4
fav_padme                31.4
dtype: float64

Because of the extreme problems of missing values in columns, I will now filter on respondents who have at least seen one movie. I will also drop the fan_expanded_universe column as no insights can come from a column of mostly missing values.

Show the code
# Filtered to only include those who have inputted a movie they have seen, don't use seen_any column as many of those who answered 'yes', also left all of those blank when listing the movies they have seen
df = df.dropna(subset=df.columns[2:8], how='all')

# Remove all rows where income (target variable) is NA
df = df.dropna(subset=['income'])

# Drop fan_expanded_universe column
df = df.drop(['fan_expanded_universe'], axis=1)

# Select columns 3–8
cols = df.columns[2:9]

# Replace NaN with "no", everything else with "yes"
df[cols] = df[cols].applymap(lambda x: 'yes' if pd.notna(x) else 'no')

After the filtering out those who have not seen any star wars movie and those who did not list their income level, the total number of respondents went from 1186, down to 675. This is because it neglects the purpose of the project, which is to use star wars respondents to predict income level. It would not make sense to use peoples responses who have not at least seen one movie nor have listed their income.

One issue I forsee happening is the data is getting smaller and smaller, but still a a decent size. I essentially cut 57% of the respondents out from the survey.

Columns regarding on if the respondent is a fan, or if they have seen certain movies, also have a lot of NA’s. Further investigation reveals they only have one option to choose from, thus I will interpret blank answers as being they have not seen that movie.

Show the code
# Percent missing values of each column, filtered by those have have seen one movie at least
missing = (df.isna().sum() / len(df) * 100).round(1)
display(f"Top 10 Missing Values Percentages of Columns")
display(missing.sort_values(ascending=False).head(10))  # Show top 10
'Top 10 Missing Values Percentages of Columns'
fav_palpatine    2.5
fav_boba         2.5
fav_padme        2.4
fav_jar          1.9
fav_lando        1.8
fav_anakin       1.6
fav_obi          1.3
fav_darth        1.2
fav_c3po         1.2
fav_yoda         1.2
dtype: float64

As you can now see, the data set is a lot cleaner when it comes to missing values.

The next step will be to examine the nature of the target variable and key predictors in order to determine which model(s) will be best suited for this project.

Exploratory Data Analysis

Show the code
# Grab value counts and Get proportion of column it represents
income_counts = df['income'].value_counts(dropna=False)
income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)

# Display the counts and proportions as a table
display(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))

# Order income for graph map
income_order = ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999','$100,000 - $149,999','$150,000+']

# Convert to ordered categorical
df['income'] = pd.Categorical(df['income'], categories=income_order, ordered=True)

display(ggplot(df, aes(x='income')) +\
    geom_bar(fill="#4C72B0") +\
    ggtitle("Income Distribution") +\
    xlab("Income Category") +\
    ylab("Number of Respondents") +
    theme(axis_text_x=element_text(angle=30, hjust=1)))
Count Proportion
income
$50,000 - $99,999 238 0.35
$25,000 - $49,999 147 0.22
$100,000 - $149,999 115 0.17
$0 - $24,999 98 0.15
$150,000+ 77 0.11

The income variable contains six categories. with the distribution showing two important points:

The 50,000 - 99,999 income level is decently biased, with about 37% of the respondents falling in. The rest of the income levels, range in between 12-23%. The variation of levels is slightly imbalanced, but we should be able to tweak some things depending on the model route I choose.

Since this is in the incorrect format for analysis, I will bin on 50k or more. We end up getting 64% of answers being 50k or more. We still have an imbalanced dataset.

Show the code
# Convert Income ranges to values 0 for less than 50k
income_map = {
    '$0 - $24,999': 0,
    '$25,000 - $49,999': 0,
    '$50,000 - $99,999': 1,
    '$100,000 - $149,999': 1,
    '$150,000+': 1
}
# Replace
df.loc[:,'income'] = df.loc[:,'income'].replace(income_map)

# Grab value counts and Get proportion of column it represents
income_counts = df['income'].value_counts(dropna=False)
income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)

# Display the counts and proportions as a table
display(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count Proportion
income
1 430 0.64
0 245 0.36

I will now move on converting string-ordinal columns, to numeric-ordinal column for the purpose of visualizing, finding correlations, and preparing for modeling.

Show the code
# Convert age to numeric, picking the median of each, 66 for >60 because 72 is average life expectancy
age_map = {'18-29': 23, '30-44': 37, '45-60': 52.5, '> 60': 66}
# Replace
df['age'] = df['age'].map(age_map)

# Preserve order in fav columns
fav_order = [
    'Very unfavorably',
    'Somewhat unfavorably',
    'Neither favorably nor unfavorably (neutral)',
    'Unfamiliar (N/A)',
    'Somewhat favorably',
    'Very favorably'
]

for col in df.columns[15:29]:
    df[col] = pd.Categorical(df[col], categories=fav_order, ordered=True)

# Convert fav_ columns to number rank
fav_map = {
  'Very favorably': 2,
  'Somewhat favorably': 1,
  'Neither favorably nor unfavorably (neutral)': 0,
  'Unfamiliar (N/A)': 0,
  'Somewhat unfavorably': -1,
  'Very unfavorably': -2
}
# Replace
df.iloc[:, list(range(15,29))] = df.iloc[:, list(range(15,29))].replace(fav_map)
Show the code
# Select favorability columns by name pattern and change dtype for correlation later
rank_cols = [col for col in df.columns if 'rank_' in col]
df[rank_cols] = df[rank_cols].astype('Int64')

Now, I will show the top correlated variables, strong or negative, with income.

Show the code
# Select all numeric columns
numeric_cols = [col for col in df.columns if df[col].dtype in ['int64', 'int8', 'Int64', 'Int8', 'float64']]

# Calculate correlations with income
correlations = df[numeric_cols + ['income']].corrwith(df['income'])
correlations = correlations.drop('income').sort_values(key=abs, ascending=False)

print("Top 5 correlations with income (>$50k):")
display((correlations.head(5)))
print("\nBottom 5 correlations:")
display(correlations.tail(5))

df['income'] = df['income'].astype('bool')
Top 5 correlations with income (>$50k):
age            0.118288
rank_Ep_II     0.107559
rank_Ep_IV    -0.080947
rank_Ep_III    0.070549
rank_Ep_VI    -0.064020
dtype: float64

Bottom 5 correlations:
rank_Ep_III    0.070549
rank_Ep_VI    -0.064020
rank_Ep_V     -0.054163
rank_Ep_I      0.053673
id             0.017514
dtype: float64

From the aforementioned correlations, we can see certain characters such as Luke, Leia, Obi-wan, and Han Solo, have around a 12-15% correlation with income.

Age, had somewhat to do as most would have guessed, with a 12% correlation with income.

The main standout from this, is no one variable is worth exploring as being a predictor of income by itself, but rather an interaction of multiple variables might be worthy pursuing.

In the coming graphs, I am looking for differences in the distributions between the levels of 2 variables, to assess if their is a multiple relationship between them and income. I will only display typical choices, and ones that are interesting.

Show the code
(
    ggplot(df, aes(x="age", fill="fav_leia")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
(
    ggplot(df, aes(x="age", fill="fav_obi")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
(
    ggplot(df, aes(x="age", fill="fav_han")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
(
    ggplot(df, aes(x="age", fill="educ")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
(
    ggplot(df, aes(x="age", fill="gender")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
(
    ggplot(df, aes(x="age", fill="who_shot_first")) +
    geom_bar(position="dodge") +
    facet_wrap("income") +
    labs(
        title="Star Wars vs Star Trek Fandom by Income",
        x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat"
    ) +
    theme(legend_position = "none")

)
Show the code
top_regions = df['region'].value_counts().nlargest(6).index
df2 = df
df2['region_plot'] = np.where(df2['region'].isin(top_regions), df2['region'], 'Other')

(
    ggplot(df2, aes(x="region_plot")) +
    geom_bar() +
    coord_flip() +
    facet_grid(y='income', x='gender') +
    labs(title="Region Distribution by Income and Gender", x="Region", y="Income")
)
Show the code
display(ggplot(df, aes("region")) +\
  geom_bar() +\
  coord_cartesian(ylim=(0,30)) +\
  facet_wrap(facets=["income", "age"]))

As I explored multi layered relationships, a few interesting things occured:

The interaction between gender, and education show differing distributions, perhaps uncovering a good predictor of income.

The interactions between age and education also added to a differing distribution.

All in all, the distributions are more similar than different across almost all interactions explored.

Therefore, I will entertain multiple tree models, including Random Forest, K-Nearest Neighbors, Gradient Boosting, Decision Tree, and NB booster. This project is limited by my knowledge, which only includes these ML algorithms as of today.

Model Prep

Show the code
# factorize into 1s 0s education column
df['educ'] = df['educ'].astype('category')
df['educ'] = pd.factorize(df['educ'])[0]
df['educ'] = df['educ'].astype('category')

# Drop ID column
df1 = df.drop(df.columns[0], axis=1)

# Separate columns to exclude from one-hot encoding
encode_mask = df1.columns.str.contains(r'fav_|income|educ|age', case=False)
new_encode = df1.loc[:, ~encode_mask].astype("category")
old_encode = df1.loc[:, encode_mask]

# One-hot encode the remaining categorical/object columns
sw_encoded = pd.get_dummies(new_encode, dummy_na=True, prefix_sep='_', drop_first=False, dtype=int)
sw_encoded.columns = sw_encoded.columns.str.replace(' ', '_')

# Concatenate back the excluded columns
sw = pd.concat([df.iloc[:,0], sw_encoded, old_encode], axis=1)

# Drop rows with 7 or more nan
sw = sw[sw.isna().sum(axis=1) <= 7]

# Drop rows with nan in outcome variable
sw = sw.dropna(subset=['income'])

# drop ID column
sw = sw.drop('id', axis=1)

To begin, I transformed age into numeric midpoints and randomly assigned values between 60 and 80 for respondents who reported > 60. The income variable was recoded into a binary classification target, where:

\texttt{income} = \begin{cases} 0 & \text{if income } < \$50{,}000 \\\\ 1 & \text{if income } \geq \$50{,}000 \end{cases}

Favorability ratings for characters (columns prefixed with fav_) were mapped to a numerical scale from -2 (very unfavorable) to +2 (very favorable). I excluded high-cardinality columns from one-hot encoding (fav_, income, educ, and age), then encoded the remaining categorical variables using pd.get_dummies() with dummy_na=True. I then dropped rows with 7 or more missing values or any missing values in the target variable.

Show the code
display(f"{len(df1)} Observations")
display(f"Percent of Observations Missing")
display((df1.isna().sum() / len(df1) * 100).sort_values(ascending=False))
'675 Observations'
'Percent of Observations Missing'
fav_palpatine                 2.518519
fav_boba                      2.518519
fav_padme                     2.370370
fav_jar                       1.925926
fav_lando                     1.777778
fav_anakin                    1.629630
fav_obi                       1.333333
fav_c3po                      1.185185
fav_darth                     1.185185
fav_yoda                      1.185185
fav_han                       0.740741
fav_r2                        0.740741
fav_leia                      0.444444
fav_luke                      0.444444
region                        0.296296
rank_Ep_III                   0.148148
rank_Ep_I                     0.148148
seen_any                      0.000000
seen_Ep_II                    0.000000
seen_Ep_I                     0.000000
fan_starwars                  0.000000
seen_Ep_III                   0.000000
rank_Ep_VI                    0.000000
seen_Ep_V                     0.000000
rank_Ep_II                    0.000000
seen_Ep_VI                    0.000000
rank_Ep_V                     0.000000
rank_Ep_IV                    0.000000
seen_Ep_IV                    0.000000
who_shot_first                0.000000
familiar_expanded_universe    0.000000
gender                        0.000000
fan_startrek                  0.000000
age                           0.000000
income                        0.000000
educ                          0.000000
region_plot                   0.000000
dtype: float64

Machine Learning Model Introduction

After preprocessing, I assigned the predictors to x and the binary target variable to y = income. The dataset was split into training and test sets using a 75/25 ratio. Missing values in x were either filled with 0 (for models that require complete input like GradientBoostingClassifier and GaussianNB) or left as-is if the model could tolerate it.

I trained and evaluated the following classifiers:

  • RandomForestClassifier
  • GradientBoostingClassifier
  • DecisionTreeClassifier
  • GaussianNB (Naive Bayes)

Each model’s performance was evaluated using accuracy_score and classification_report from sklearn.metrics.

Show the code
# Assign Predictors and outcome variables
x = sw.drop(columns='income')
y = sw['income'].astype(int)
Show the code
# Split dataset into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=50)

# Split and fill na's for gradient/Gassian
x_train_na = x_train.fillna(0)
x_test_na = x_test.fillna(0)
Show the code
# Train Random Forest Classifier
classifier_DT = RandomForestClassifier(
    random_state=70,
    max_depth=12

)
classifier_DT.fit(x_train, y_train)

# Train Gradient Boosting Classifier
classifier_DT2 = GradientBoostingClassifier(
  random_state=70,
  learning_rate=.01,
  max_depth=7

)
classifier_DT2.fit(x_train_na, y_train)

# Train Decision Tree Classifier
classifier_DT3 = DecisionTreeClassifier(
    random_state=70,
    max_depth=6
)
classifier_DT3.fit(x_train, y_train)

# Train Gaussian NB Classifier
classifier_DT4 = GaussianNB()
classifier_DT4.fit(x_train_na, y_train)

# KN
classifier_DT5 = KNeighborsClassifier(n_neighbors=25,
    p=2                  # Power parameter (1=Manhattan, 2=Euclidean)
)
classifier_DT5.fit(x_train_na, y_train)
KNeighborsClassifier(n_neighbors=25)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code
y_pred = classifier_DT.predict(x_test)
print("Random Forest:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

y_pred2 = classifier_DT2.predict(x_test_na)
print("Gradient Boosting:")
print("Accuracy:", accuracy_score(y_test, y_pred2))
print(classification_report(y_test, y_pred2))
Random Forest:
Accuracy: 0.6726190476190477
              precision    recall  f1-score   support

           0       0.57      0.29      0.38        59
           1       0.70      0.88      0.78       109

    accuracy                           0.67       168
   macro avg       0.63      0.58      0.58       168
weighted avg       0.65      0.67      0.64       168

Gradient Boosting:
Accuracy: 0.6369047619047619
              precision    recall  f1-score   support

           0       0.47      0.29      0.36        59
           1       0.68      0.83      0.75       109

    accuracy                           0.64       168
   macro avg       0.58      0.56      0.55       168
weighted avg       0.61      0.64      0.61       168
Show the code
y_pred3 = classifier_DT3.predict(x_test)
print("Decision Tree:")
print("Accuracy:", accuracy_score(y_test, y_pred3))
print(classification_report(y_test, y_pred3))
Decision Tree:
Accuracy: 0.6190476190476191
              precision    recall  f1-score   support

           0       0.43      0.27      0.33        59
           1       0.67      0.81      0.73       109

    accuracy                           0.62       168
   macro avg       0.55      0.54      0.53       168
weighted avg       0.59      0.62      0.59       168
Show the code
y_pred4 = classifier_DT4.predict(x_test_na)
print("Gaussian NB:")
print("Accuracy:", accuracy_score(y_test, y_pred4))
print(classification_report(y_test, y_pred4))

y_pred5 = classifier_DT5.predict(x_test_na)
print("K-Neighbors:")
print("Accuracy:", accuracy_score(y_test, y_pred5))
print(classification_report(y_test, y_pred5))
Gaussian NB:
Accuracy: 0.6071428571428571
              precision    recall  f1-score   support

           0       0.44      0.41      0.42        59
           1       0.69      0.72      0.70       109

    accuracy                           0.61       168
   macro avg       0.56      0.56      0.56       168
weighted avg       0.60      0.61      0.60       168

K-Neighbors:
Accuracy: 0.6547619047619048
              precision    recall  f1-score   support

           0       0.52      0.24      0.33        59
           1       0.68      0.88      0.77       109

    accuracy                           0.65       168
   macro avg       0.60      0.56      0.55       168
weighted avg       0.62      0.65      0.61       168
Show the code
model_scores = pd.DataFrame({
    'Model': ['Random Forest', 'Gradient Boosting', 'Decision Tree', 'Gaussian NB'],
    'Accuracy': [0.690, 0.667, 0.625, 0.649]
})

ggplot(model_scores, aes(x='Model', y='Accuracy', fill='Model')) + \
    geom_bar(stat='Identity', size=7) + \
    coord_flip() +\
    labs(title='Model Accuracy Comparison', y='Accuracy') + \
    theme_bw() + \
    theme(
        axis_text_x=element_text(hjust=.5, size=12),
        axis_text_y=element_text(size=12),
        plot_title=element_text(size=16, face='bold', hjust=0.5),
        legend_position='none'
    )

Model Performance Analysis

Random Forest achieved the highest overall accuracy at 69.0%, with strong recall (91%) and F1-score (0.81) for predicting individuals earning at least $50k. However, it performed very poorly on the lower-income class with just 22% f1 and 15% recall.

Gradient Boosting followed with an accuracy of 67.8%. It showed identical results to random forest as far as recall, precision, and f1 go. In essence, great at predicting the true cases, and terrible at the untrue cases.

** Decision Tree classifier follows the same pattern overall as Random Forest and Gradient Boosting Classifiers.

Gaussian Naive Bayes reached 28..8% accuracy but completely failed to classify high-income respondents, with 1% precision. It was heavily biased toward predicting class 2 only.

**K-Neighbors, follows a relatively similar pattern, with 67% accuracy and a little more balanced recall and f1 scores.

Conclusion and Recommendation

While Gradient Boosting offered a more balanced performance across both classes, I believe the Random Forest Classifier is the better choice in this context. Since the primary goal is to accurately predict whether someone earns more than $50,000, overall accuracy and precision for the high-income class matter more than recall for the low-income group. Random Forest achieved the highest accuracy and the strongest precision for class 1, making it more effective at identifying higher earners — which aligns directly with the objective of this analysis.

Feature Importance in Gradient Boosting Results

After training the RandomForestClassifier, I examined which features contributed most to the model’s predictions.

Using the feature_importances_ attribute of the trained model, I identified the top predictors for determining whether a respondent earns over $50,000. These features reflect a combination of demographic factors and Star Wars-related preferences that the model found most useful for predicting income level.

Here are the top 10 most important features:

Show the code
# Get top 10 features by importance
importances = classifier_DT.feature_importances_
features = x_train.columns

importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
top_features = importance_df.sort_values(by='Importance', ascending=False).head(10)

ggplot(top_features, aes(x='Feature', y='Importance', fill='Feature')) + \
    geom_bar(stat='identity', color='black', size=0.4) + \
    ggtitle('Feature Importance - Random Forest') + \
    xlab('Feature') + \
    ylab('Importance') + \
    theme_minimal() + \
    theme(
        axis_text_x=element_text(angle=45, hjust=1, size=12),
        axis_text_y=element_text(size=12),
        plot_title=element_text(size=16, face='bold', hjust=0.5),
        legend_position='none'
    )

In the Random Forest model, feature importance was more evenly distributed across a mix of predictors. Among the features, age and educ were most important, which aligns with expected links between demographics and income. Several Star Wars favorability ratings — including fav_jar, fav_darth, fav_anakin, and fav_padme — also contributed meaningfully. This broader distribution of importance suggests the model’s draws on a variety of demographic and preference-based factors.

In conclusion, it seems as if star wars related preferences and education and age, are not very powerful in predicting income being greater than 50k or not.